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Abstract 

Stability Selection was recently introduced by Meinshausen and BiihlmanrJ (|2010l ) as a very general 
technique designed to improve the performance of a variable selection algorithm. It is based on aggre- 
gating the results of applying a selection procedure to subsamples of the data. We introduce a variant, 
called Complementary Pairs Stability Selection (CPSS), and derive bounds both on the expected number 
of variables included by CPSS that have low selection probability under the original procedure, and on 
the expected number of high selection probability variables that are excluded. These results require no 
(e.g. exchangeability) assumptions on the underlying model or on the quality of the original selection 
procedure. Under reasonable shape restrictions, the bounds can be further tightened, yielding improved 
error control, and therefore increasing the applicability of the methodology. 

Key words: Complementary Pairs Stability Selection, r-concavity, subagging, subsampling, variable 
selection 

1 Introduction 

The problem of variable selection has received a huge amount of attention over the last 15 years, motivated 
by the desire to understand structure in massive data sets that are now routinely encountered across many 
scientific disciplines. It is now very common, e.g. in biological applications, image analysis and portfolio 
allocation problems as well as many others, for the number of variables (or predictors) p that are measured 
to exceed the number of observations n. In such circumstances, variable selection is essential for model 
interpretation. 



In a notable recent contribution to the now vast literature on this topic, iMeinshausen and Biihlmann 



(|2010l) Droposed Stability Selection as a very general technique designed to improve the performance of a 



variable selection algorithm. The basic idea is that instead of applying one's favourite algorithm to the 
whole data set to determine the selected set of variables, one instead applies it several times to random 
subsamples of the data of size [?t-/2J, and chooses those variables that are selec ted most f requently o n the 



subsamples . Stability Selection is th erefore intimately connected with bagging ( Breimanl . Il996l Il999f ) and 
subagging (jBuhlmann and "^ . l2002l) . 



A particularly attractive feature of Stability Sele ction is the error control provided b y an upper bound 
on the expected number of falsely selected variables ( Meinshausen and Biihlmann . 2010l Theorem 1). Such 



control is typically unavailable when applying the original selection procedure to the whole data set, and 
allows the practitioner to select the threshold r for the proportion of subsamples for which a variable must 
be selected in order for it to be declared significant. 

However, the bound does have a couple of drawbacks. Firstly, it applies to the 'population version' of 
the subsampling process, i.e. to the version of the procedure that aggregates results over the non-random 
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choice of all (^,"2]) subsamples. Even for n as small as 15, it is unrealistic to expect this version to be used 
in practice, and in fact choosing around 100 random subsamples is probably typical. More seriously, the 
bound is derived under a very strong exchangeability assumption on the selection of noise variables (as well 
as a weak one on the quality of the original selection procedure, namely that it is not worse than random 
guessing) . 

In this paper, we develop the methodology and conceptual understanding of Stability Selection in several 
respects. We introduce a variant of Stability Selection, where the subsamples are drawn as complementary 
pairs from {1, . . . , n}. Thus the subsampling procedure outputs index sets {{A2j~i, ^2j) ■ j = ■ ■ ■ , B}, 
where each Aj is a subset of {1, . . . , ri} of size [n/2\ , and A2j-ir\A2j = 0. We call this variant Complementary 
Pairs Stability Selection (CPSS). 

At first glance it would seem that CPSS would be expected to yield very similar results to the original 
version of Stability Selection. However, we show that CPSS in fact has the following properties: 

(i) The Meinshausen-Biihlmann bound holds for CPSS regardless of the number of complementary pairs 
B chosen - even with B — 1. 

(ii) There is a corresponding bound for the number of important variables excluded by CPSS. 

(iii) Our results have no conditions on the original selection procedure, and in particular do not require the 
strong exchangeability assumption on the selection of noise variables. Indeed, we argue that even a 
precise definition of 'signal' and 'noise' variables is not helpful in trying to understand the properties of 
CPSS, and we instead state the bounds in terms of the expected number of variables chosen by CPSS 
that have low selection probability under the base selection procedure, and the expected number of 
high selection probability variables that are excluded by CPSS. See Section [5] for further discussion. 

(iv) The bound on the number of low selection probability variables chosen by CPSS can be significantly 
sharpened under mild shape restrictions (e.g. unimodality or r-concavity) on the distribution of the 
proportion of times a variable is selected in both A2j-i and A2j. We discuss these conditions in detail 
in Sections 13.21 and 13.31 respectively, and compare both the original and new bounds to demonstrate 
the marked improvement. 

Our improved bounds are based on new versions of Markov's inequality that hold for random variables whose 
distributions are unimodal or r-concave. However, it is important to note at this point that the results are not 
just a theoretical contribution; they allow the practitioner to reduce r (and therefore select more variables) 
for the same control of the number of low selection probability variables chosen by CPSS. In Section [3^ we 
give recommendations on how a practitioner can make use of the bounds in applying CPSS. 

In Section 14.11 we present the results of an extensive simulation study designed to illustrate the appro- 
priateness of our shape restrictions, and to compare Stability Selection and CPSS with their base selection 
procedures. 

A review of some of the extensive literature on variable selection can be found in Fan and l3 (I2010I) . 



Work related more specifically to Stability Selection includes Bach ( 2008[ ). who studied the Bolasso (short for 



Bootstrapped enhanced Lasso). This involves applying the Lasso to bootstrap (with replacement) samples 
from the original data, rather than subsampling without replacement. A final estimate is obtained by 
applying the Lasso to the intersection of the set of variables selected across the bootstrap samples. Various 
authors, particularly in the machine learning literature, have considered the stability of a feature selection 
algorit hm, i.e. the insensit iv ity of the output of the algori t hm to variations in the tr aining set; such studies 
i nclud e fLange et a l! (2003*), Kalousis, Prados and Hilario (2007|), Kunch eva (2007|), iLoscalzo. Yu and DinsJ 



( 20091 ) and Han and Yu (.2010). Saevs. Abeel and Peer. (2008.) consider obtaining a final feature ranking by 



aggregating the rankings across bootstrap samples. 



2 Complementary Pairs Stability Selection 

In order to keep our discussion rather general, we only assume that we have vector- valued data zi , . . . , 
which we take to be a realisation of independent and identically distributed random elements Zi, . . . , Z„ 
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Informally, we think of some of the components of Zi as being 'signal variables', and others as being 'noise 
variables', though for our purposes it is not necessary to define these notions precisely. Formally, we let 
S C {1, . . . ,p} and N := {1, . . . ,p} \ 5, thought of as the index sets of the signal and noise variables 
respectively. A variable selection procedure is a statistic 5„ := S'„(Zi, . . . , Z„) taking values in the set of all 
subsets of {1, ... and we think of Sn as an estimator of 5*. As a typical example, we may often write 
Zi = {Xi, Yi) with the covariate Xi € MP and the response Fj e K, and our (pseudo) log-likelihood might be 
of the form 



i=l 



L{Y,.Xj(i), (1) 



for some /3 e W . In this context, we regard S := {k : f3k 7^ 0} as the signal indices, N = {k : — 0} as 
noise indices. Examples from graphical modelling can also be cast within our framework. Note however that 
we do not require a (pseudo) log- likelihood of the form ([T]). 

We define the selection probability of a variable index k E { 1 , . . . , p} under Sn as 

Pk,n=nke Sn)=E{l^^^g^y). (2) 

We take the view that for understanding the properties of Stability Selection, the selection probabilities 
Pk,n are the fundamental quantities of interest. Since an application of Stability Selection is contingent on 
a choice of base selection procedure Sn, all we can hope is that it selects variables having high selection 
probability under the base procedure, and avoids selecting those variables with low selection probability. 
Indeed this turns out to be the case; see Theorem [1] below. 

Of course, l{j,£5 j has a Bernoulli distribution with parameter pk,n, so we may view l{i,g5 -j as an 
unbiased estimator of pfe.„ (though pk^n is not a model parameter in the conventional sense). The key idea 
of Stability Selection is to improve on this simple estimator of pk,n through subsampling. 

For a subset A — {ii, . . . ,i\A\} C {1, . . . ,n} with ii < ■ ■ ■ < we shall write 

S{A) := S\A\{Zii, . . . , ^i|^|). 

Definition 1 (Complementary Pairs Stability Selection). Let {{A2j-i, A2j) : j — 1,...,B} be randomly 
chosen independent pairs of subsets of {1, . . . , n} of size [n/2j such that A2j-i D A2j — 0. For r € [0, 1], 
the Complementary Pairs Stability Selection version of a variable selection procedure Sn is S^^^^ — {k : 
ns(fc) > t}, where the function tig : {1, . . . ,p} — )■ {0, -g, . . . , 1} is given by 
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^^(''y-= ^T^h^eSiA,)}- (3) 
i=i 

Note that lis (/c) is an unbiased estimator of ^11/2] i but, in general, a biased estimator of pk,n- However, 
by means of the averaging involved in ([3]), we hope that IlB{k) will have reduced variance compared with 
l{fcg5 and that this increased stability will more than compensate for the bias incurred. Indeed, this is the 
case i n other s ituations where bagging and subagging ha ve been su ccessfully applie d, such as clas sification 
trees (iBreiman. logs') or nearest neighbour classifiers (|IIall and Samworth . ,2005, : Biau. Cerou and Guvadej . 
2010l:ISamworthLl201lh . 



An alternative to subsampling complementary pairs would be to use bootstrap sampling. We have found 
that this giv6S very similar Gstima,tGS of p/e,n; 

though most of our theoretical arguments do not apply when the 
bootstrap is used (the approach in Section 13.3.11 is an exception in this regard). In fact, taking subsamples 
of size \n/2\ can be thought of as the subsampling scheme that most closely mimics the bootstrap (e.g. 



Diimbgen. Samworth and Schuhmacheii 120111 ). 



It is convenient at this stage to define another related selection procedure based on sample splitting. 

Definition 2 (Simultaneous Selection). Let {(A2j^i, A2j) : j — 1,...,B} be randomly chosen independent 
pairs of subsets of {1, . . . ,n} of size [n/2j such that A2j-i D A2j = 0. For r G [0, 1], the Simultaneous 
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Selection 



ofSn is S^™ = {k : fleik) > t}, where 



1 ^ 



(4) 



i=i 



For our purposes, Simultaneous Selection is a tool for understan ding the properties of CPSS. H owever, 
the special case of B = 1 of Simultaneous Selection was studied bv lFan. Samworth and Wu ( 2009 ), and a 
variant involving all possible disjoint pairs of subsets was considered in Meinshausen and Biihlmann ( 2010( ). 



3 Theoretical properties 
3.1 Worst-case bounds 

In Theorem [T] below, we show that the expected number of low selection probability variables chosen by 
CPSS is controlled in terms of the expected number chosen by the original selection procedure, with a 
corresponding result for the expected number of high selection probability variables not chosen by CPSS. 
The appealing feature of these results is their generality: they require no assumptions on the underlying 
model or on the quality of the original selection procedure, and they apply regardless of the number B of 
complementary pairs of subsets chosen. 

For 9 G [0,1], let Lg — {k : Pk.[n/2\ 1^ ^} denote the set of variable indices that have low selection 
probability under S'[n/2j , and let He = {k : Pk,\n/2\ > ^} denote the set of those that have high selection 
probability. 

Theorem 1. (i) If t e (^,1], then 

Ejs^pss n Le\ < ^^nsinm n Le\. 



(ii) Let iVCPSS ^ {1, . . . ,p} \ ^CPSS and N,, ^ {1, . . . ,p}\ S,,. If t e [0, i), then 

E|iV,'r,r^ nHe\< ^—^nNin/2\ nHg\. 

In many applications, and for a good base selection procedure, we imagine that the set of selection 
probabilities {pk.[n/2] ■ k = I, . . . ,p} is positively skewed in [0, 1], with many selection probabilities being 
very low (predominantly noise variables), and with just a few being large (including at least some of the 
signal variables). To illustrate Theorem [TJi) , consider a situation with p = 1000 variables and where the 
base selection procedure chooses 50 of them. Then Theorem [TJi) shows that on average CPSS with r = 0.6 
selects no more than a quarter of the below average s election probability variables chosen by »S'|^„/2J • 



Our Theorem [IJi) is analogous to Theorem 1 of iMeinshausen and Biihlmannl (j2010t ) . The differences 



are that we do not require the condition that ^^^j : k G N} is exchangeable, nor that the original 

procedure is no worse than random guessing, and our result holds for all B. The price we pay is that the 
bound is stated in terms of the expected number of low selection probability variables chosen by CPSS, 
rather than the expected number of noise variables, which we do for the reasons described in Section [21 If 
the exchangeability and random guessing conditions mentioned above do hold, then, writing q E,\S^n/2] |j 
we recover 



The final bound here was obtained in Theorem 1 of IMeinshausen and Biihlmann ( 2010l ) for the population 
version of Stability Selection. 
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3.2 Improved bounds under unimodality 



Despite the attractions of Theorem[Tl the following observations suggest there may be scope for improvement. 
Firstly, we expect we should be able to obtain tighter bounds as B increases. Secondly, and more importantly, 
examination of the proof of Theorem [TJi) shows that our bound relies on first noting that 

l + IlB(fc) > 2flB(fc), (5) 

and then applying Markov's inequality to nB(fc). For equality in Markov's inequality, \\B{k) must be a 
mixture of point masses at and 2t — 1, but Figure [1] suggests that the distribution of nB(fc), which is 
supported on {0, -g, -|, . . . , 1}, can be very different from this. Indeed, our experience, based on extensive 
simulation studies, is that when 9 is close to q/p (which is where the bound in Theorem [IJi) is probably of 
most interest), the distribution of Hsik) over k € Lg is remarkably consistent over different data generating 
processes, and Figure [T] is typical. It is therefore natural to consider placing shape restrictions on the 
distribution of 11^ (k) which encompass what we see in practice, and which yield stronger versions of Markov's 
inequality. As a first step in this direction, we consider the assumption of unimodality. 

Theorem 2. Suppose that the distribution ofYlsik) is unimodal for each k G Lq. //r G + -g, ^ + ^ + 
then 

jCPSS 



El^^r.r ' n Lg\ < C(t, i?) 0E|^Ln/2j H Lg 



where, when 9 < l/\/3) 



C{t,B) 



2(2r - 1 - 1/2B 
4(1-T + 1/2B) 



tfre (min(i + 9^ i ^ _L ^ 3fl2^ 3i 



^2 ' " ' 2 ' 2B ' 4'' 4J 



1 + 1/5 



The proof of Theorem [2] is based on a new version of Markov's inequality (Theorem [9] in the Appendix) 
for random variables with unimodal distributions supported on a finite lattice. There is also an explicit 
expression for C{t,B) when 9 > l/\/3, which follows from Theorem |9] in the same way, but we do not 
present it here because it is a little more complicated, and because we anticipate the bound when 9 is 
(much) smaller than l/-\/3 being of most use in practice. See Section 13.41 for further discussion. 

Figure [2] compares the bounds provided by Theorems [T] and Theorem [2] as a function of r, for the 
illustration discussed after the statement of Theorem [TJ 



3.3 Further improvements under r-concavity 



The unimodal assumption allows for a significant improvement in the bounds attainable from a naive ap- 
plication of Markov's inequality. However, Figure [1] suggests that further gains may be realised by placing 
tighter constraints on the family of distributions for Tlsik) that we consider, in order to match better the 
empirical distributions that we see in practice. 

A very natural constraint to impose on the distribution oiIlB{k) is log-concavity. By this, we mean that, if 
/ denotes the probability mass function oiIlB{k), then the linear interpolant to {(i, f{i/B)) : i = 0,1, . . . , B} 
is a log-concave func tion on [0,11. L o g-concavity is a shape constraint that has received a great deal of a t- 
tention recently (e.g. Walther ( 2002 ): Diimbgen and Rufibach ( 20091) : Cule. Samworth and Stewart ( 2010t )). 
and at first sight it seems reasonable in our context, because if the summands in ^ were independent, then 
we would have Ils(fc) ^ ^Bm{B,pf. |^„/2j)j which is log-concave. 

It is indeed possible to obtain a version of Markov's inequality under log-concavity that leads to another 
improvement in the bound on EjS',*^'^^^ H Lg\. However, we found that in practice, the dependence structure 
of the summands in ^ meant that the log-concavity constraint was a little too strong. We therefore consider 
instead the class of r-concave distributions, which we claim defines a continuum of constraints that interpolate 
between log-concavity and unimodality (see Propositions |3] and |4] below) . This constraint has also been 
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Figure 1: Rows 1 to 3 show a typical example of the full probability mass function (left) and zoomed in from 0.2 
onwards (right) of n2B(A;) for k € L^/p (black), alongside the unrestricted, unimodal and — 1/2-concave distributions 
respectively (grey), which have maximum tail probability beyond 0.2. This situation corresponds to selecting r = 0.6. 
Bottom left: the observed mass function (circles) and the e&tremal —1/2-concave mass function (crosses) on the x~^^^ 
scale. Bottom right: tail probabilities from 0.2 onwards for each of the distributions. 
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Figure 2: Compari son of the bounds on Elff^^^^ H -^n/ nl for different values of the threshold t: the original bound 
from Theorem 1 of iMeinshausen and BiihlmanrJ ()2010l ) (long dashes), our worst case bound (dots and dashes), the 
unimodal bound (dots) and the r-concave bound ([8]) (short dashes). The solid line is the true value of EIS'^'^^^ nl/q/p| 
for a simulated example. In this case p = 1000, q = 50 and the number of signal variables was 8. 



studie d recentl y in the context of density estimation bv lSeregin and Wellner ( 2010( ) and Koenker and Mizeral 
see also iDharmadhikari and Joag-DevI (Il988l) . 
To define the class, we recall that the r^^ generalised mean AIr{a, b; A) of a, & > is given by 

Mria,b;X) =^ {{1 - X)a'- + Xb'-y^'- 

for r > 0. This is also well-defined for r < if we take Mr{a, b;X)~0 when ab — 0, and define 0*" — oo. In 
addition, we may define 



Mo(a, b; X) := lim MJa, b; A) = a^-^b^ 
M_oo(a, 6; A) := lim Mr(a, 6; A) = min(a, &). 

r—^ — oo 

We are now in a position to define r-concavity. 

Definition 3. A non-negative function f on an interval J C M is r-concave if for every x,y E I and 
X e (0, 1), we have 

f{{l-X)x + Xv)>Mr{f(x)J{y)-X). 

Definition 4. A probability mass function f supported on {0, -i, -g. . . . , 1} is r-concave if the linear inter- 
polant to {(i, f{i/B)) : i — 0,1, . . . , B} is r-concave. 

When r < 0, it is easy to see that / is r-concave if and only if f^ is convex. Let Jv denote the class of 
r-concave probability mass functions on {0, -g, . . . , 1}. Then each / € J> is unimodal, and as Mr{a, b; A) 
is non-decreasing in r for fixed a and b, we have J-r D J>' for r < r' . Furthermore, / is unimodal if it is — oo- 
concave, and / is log-concave if it is 0-concave. The following two results further support the interpretation 
of r-concavity for re [— cx), 0] as an interpolation between log-concavity and unimodality. 



Proposition 3. A function f is log-concave if and only if it is r-concave for every r < 0. 
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Proposition 4. Let f he a unimodal probability mass function supported on {0, -g, -g, . . . , 1} and suppose 
both that /(O) < . . . < /(-^) = /(^) = . . . = /(f) and that /(f) > f{^) >...> /(I), for some I < u. 
Then f is r-concave for some r < 0. 

In Proposition [TT] in the Appendix, we present a result that characterises those r-concave distributions 
that attain equality in a version of Markov's inequality for random variables with r-concave distributions on 
{0, -g, . . . , 1}. If we assume that tlsik) is r-concave for all k e Lg, using ([S]), for these variables we can 
obtain a bound of the form 

P(nB(fc) > r) < Z?(g,Ln/2j , 2t - 1, B, r) < D{9\2t - 1, B, r) (6) 

where D{ri,t, B,r) denotes the maximum of ¥{X > t) over all r-concave random variables supported on 
{0, . . . , 1} with E{X) < rj. Although D does not appear to have a closed form, it is straightforward 

to compute numerically, as we describe in Section IA.4I The lack of a simple form means a direct analogue 
Theorem [5] is not available. We can nevertheless obtain the following bound on the expected number of low 
selection probability variables chosen by CPSS: 

El^cpss^^^l^ Y^V{tlB{k)>T)<D{e\2T~l,B,r)\Le\. (7) 

keLo 

Our simulation studies suggest that r = —1/2 is a sensible choice to use for the bound. In other words, 
if / denotes the probability mass function of ris(/c), then the linear interpolant to {{i, f{i/ B)^^/"^) : i = 
0,1,..., i?} is typically well approximated by a convex function. This is illustrated in the bottom left panel 
of Figure [T] (note that the right-hand tail in this plot corresponds to tiny probabilities). 

3.3.1 Lovi^ering the threshold r 

The bounds obtained thus far have used the relationship ([S]) to convert a Markov bound for tlsik) into a 
corresponding one for the statistic of interest, IlB(fc). The advantage of this approach is that 'Kilisik)) = 
Pk Ln/2j much smaller than E(IlB(fc)) = Pfc,[n/2J for variables with low selection probability, so the Markov 
bound is quite tight. However, for r close to 1/2, the inequality ([5]) starts to become weak, and bounds can 
only be obtained for t > 1/2 in any case. 

To solve this problem, we can apply our versions of Markov's inequality directly to Tlsik). We have found, 
through our simulations, that for variables with low selection probability, the distribution of IlB(fc) can be 
modelled very well as a — 1/4-concave distribution (see Figure [J]). That the distribution of IlB(fc) is closer 
to log-concavity than that of Hsik) is intuitive because although the summands in ^ are not independent, 
terms involving subsamples which have little overlap will be close to independent. If we assume that JlB{k) 
is — 1/2-concave and that Ils(fc) is —1/4-concave for all k G Lg, we can obtain our best bound 

El^,'^,^ Sp^iel <mm{Di9\2T~l,B,^l/A),D{e,T,2B,-l/2)}\Lg\, (8) 

which is valid for all r e {6, 1], provided we adopt the convention that D{-, t, •, •) = 1 for t < 0. The resulting 
improvements in the bounds can been seen in Figure [21 Note the kink in Figure [2] for the r-concave bound 
(HI) just before r = 0.6. This corresponds to the transition from where D{d, r, 2B, —1/4) is smaller to where 
D{e'^,2T- 1,5,-1/2) is smaller. 

We applied the algorithm described in Section IA.4I to produce tables of values of 

min{i:>(6l^ 2t - 1, 50, -1/2), D{9, r, 100, -1/4)} 

over a grid of 6 and r values; see Table [2] and Table |3l 

3.4 How to use these bounds in practice 

The quantities \Lg\ and E|5l„/2J H Lg\, which appear on the right hand sides of the bounds, will in general 
be unknown to the statistician. Thus when using the bounds, they will typically need to be replaced by p 
and q respectively. In addition, several parameters must be selected, and in this section we go through each 
of these in turn and give guidance on how to choose them. 
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Figure 3: A typical example of the probability mass function of 1125 (fc) for k £ Lq/p (black bars and circles), alongside 
the — 1/4-concave distribution (grey bars and crosses), which has maximum tail probability beyond 0.4. 



Choice of B. We recommend B = 50 as a default value. Choosing B larger than this increases the 
computational burden, and may lead to the r-concavity assumptions being violated. 



Choice of 0. As mentioned at the beginning of Section [321 — q/p is a, natural choice. In other words, 
we regard the below average selection probability variables as the irrelevant variables. Other choices of 9 
are possible, but the use of ^ and ([7]) to construct the bound suggests that the inequality will be tightest 
when most of the variables have a selection probability close to 6. 



Choice of q and threshold t. One can regard the choice of q ^ E(|S'l„/2j|) (which is usually fixed 
through a tuning parameter A) as part of the choice of the base selection procedure. One option is to fix q 
by varying A at each evaluation of the selection procedure until it selects q variables. However, if the number 
of variables selected at each iteration is unknown in advance (e.g. if A is fixed, or if cross-validation is used 
to choose A at each iteration), then q can be estimated by X]fc=i ^^(fc). 

An important point to note is that although choosing A or q is usually crucial when carrying out variable 
selection, this is not the case when usin g CPSS. Our experience is that the pe rformance of CPSS is surprisingly 



insensitive to the choice of q (see also Meinshausen and Btihlmannl ( 2010| )). That is to say, Lq/p does not 



vary much as q varies, and also the final selected sets for different values of q tend to be similar (where 
different thresholds are chosen to control the selection of variables in L^/p at a pre-specified level). Thus, 
when using CPSS, it is the threshold t that plays a role similar to that of a tuning parameter for the base 
procedure. The great advantage of CPSS is that our bounds allow one to choose r to control the expected 
number of low selection probability variables selected. 

To summarise: we recommend as a sensible default CPSS procedure taking B = 50 and 6 = q/p. We 
then choose r using the bound ([H]) with \Lg\ replaced by p to control the expected number of low selection 
probability variables chosen. 
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4 Numerical properties 



4.1 Simulation Study 

In this section we investigate the performance and vahdity of the bounds derived in the previous section by 
applying CPSS to simulated data. We consider both linear and logistic regression and different values of p 
and n. In each of these settings, we first generate independent explanatory vectors Xi, . . . ,Xn with each 
Xi ~ Np{0, S). We use a Toeplitz covariance matrix E with entries 

y.. _ „IK-j|-p/2|-p/2 
i^ij — p , 

and we look at various values of p in [0, 1). So the correlation between the components decays exponentially 
with the distance between them in Zp. 

For linear regression, we generate a vector of errors e ^ Nn{{)^ ct^/) and set 

r = X/3 + e, 

where the design matrix X has i'^ row Xf . The error variance is chosen to achieve different values of 
the signal-to- noise ratio (SNR), which we define here by 



For logistic regression, we generate independent responses 

Bin(l,pi), j = l,...,n, 

where 

Here 7 is a scaling factor which is chosen to achieve a particular Bayes error rate. 

In both cases, we fix the p-dimensional vector of coefficients j3 to have s <^ p non-zero components, s/2 of 
which we choose as equally spaced points within [—1, —0.5] with the remaining s/2 equally spaced in [0.5, 1]. 
The indices of the non-zero components, 5', are chosen to follow a geometric progression up to rounding, 
with first term 1 and (s -I- 1)*^ term p + 1. The values are then randomly assigned to each index in S, but 
this choice is then fixed for each particular simulation setting. 

With p > 0, this setup will have several signal variables correlated amongst themselves, and also some 
signal correlated with noise. In this way, the framework above includes a very wide variety of different data 
generating processes on which we can test the theory of the previous section. 

By varying the base selection procedure, its tuning parameters, the values of p, n, p, s and also the SNR 
and Bayes error rates, we have applied CPSS in several hundred different simulation settings. For reasons 
of space, we present only a subset of these numerical experiments below, but the results from those omitted 
are not qualitatively different. 

In the graphs which follow, w e look at CPSS applied to the La.sso ( Tibshiranil. 19961). which we imple- 



ment ed using the package glmnet ( Friedman. Hastie and Tibshirani . 201o[) in R (Ir Development Core Ttea: 



2OIOI) . We follow the original stability selection procedure put forward in lMeinshausen and Biihlmannl (|201 



Teanj . 
2010h 



and compare this to the method suggested by our r-concave bound (|8]). Thus we first choose the level I 
at which we wish to control the expected number of low selection probability variables (so we aim to have 
n Lq/p\ < Vj. Then we fix q = i/0.8Zp and set the threshold r at 0.9. This ensures that, according to 
the original worst case bound, we control the expected number of low selection probability variables selected 
at the required level. In the r-concave case, we take our threshold as 

f = min{T e {0, 1/2B, . . . , 1} : min{D{q^lp\ 2r - 1, B, -1/2), D{q/p, r, 25,-1/4)} < l/p}. 
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We also give the results one would obtain using the Lasso alone, but with the benefit of an oracle which 
knows the optimal value of the tuning parameter A. That is, we take as our selected set, where 

A*=inf{A:E|S';^nL,/p|<?}, 

and is the selected set when using the Lasso with tuning parameter A applied to the whole data set. 

We present all of our results relative to the performance of CPSS using an oraclc-driven threshold r*, 
where r* is defined by 

T* = mm{T e {0, 1/2B, ...,!}: E|5,^f ^ ^ ^^^^j < 
Referring to Figures |3][71 the heights of the black bars, grey bars and crosses are given by 

ns^'of n ^1 ns^^'^nsi E|5f n s\ 

EI^CPSS n ^1 ' El^cPfS n 5| '''' EI^CPSS n ^1 ' 

respectively. Thus the heights of the black and grey bars relate to the loss of power in using the threshold 
suggested by the corresponding bounds. In all of our simulations, we used B = 50. Each scenario was 
run 500 times, and in order to determine the set ig/p, in each scenario, we applied the particular selection 
procedure 5'l„/2J to 50,000 independent data sets. 

It is immediately obvious from the results that using the r-concave bound, we are able to recover sig- 
nificantly more variables in S than when using the the worst case bound. Furthermore, though it is not 
shown in the graphs explicitly, we also achieve the required level of error control in all but one case (where 
the r-concavity assumption fails). In fact the one particular example is hardly exceptional in that we have 
E\S^^^^ nLq/p\ — 1.034 > 1 = L Thus in close accordance with our theory, there are no significant violations 
of the r-concave bound. 

We also see that the loss in power due to using f rather than r*, is very low. In almost all of the scenarios, 
we are able to select more than 75% of the signal we could select with the benefit of an oracle, and usually 
much more than this. It is interesting that the performance of the oracle CPSS and oracle Lasso procedures 
are fairly similar. The key advantage of CPSS is that it allows for error control whereas there is in general 
no way of determining (or even approximating) the optimal A* that achieves the required error control. In 
fact, the performance of CPSS with our bound is only slightly worse then that of the oracle Lasso procedure, 
and in a few cases, particularly when p is small, it is even slightly better. In the cases where p > 0.75, we 
see that CPSS is not quite as powerful. This is because having such large correlations between variables 
causes {pfc.[n/2j : k = I, . . . ,p} to be relatively spread out in [0, 1]. As explained in Section 13.41 we expect 
our bound to weaken in this situation. However, even when the correlation is as high as 0.9, we recover a 
sizeable proportion of the signal we would select had we used the optimal r* . 



4.2 Real data example 

Here we illustrate our CPSS methodology on the widely studied colon data set of Alon et al. ( 1999I ). freely 
available at http://microarray.princeton.edu/oncology/affydata/index.html. The data consist of 
2000 gene expression levels from 40 colon tumour samples and 22 normal colon tissue samples, measured using 
Affymetrix oligonucleotide arrays. Our goal is to identify a small subset of genes which we are confident are 
linked with the development of colon cancer. Such a task is important for improving scientific understanding 
of the disease and for selecting genes as potential drug targets. 

The data were first preprocessed by averaging over the expression levels for repeated genes (which had 
been tiled more than once on each array), log-transforming each gene expression level, standardising each 
row to have mean zero and unit variance, and finally removing the columns corresponding to control genes, 
so that p = 1908 genes remained. The transformation and standardisation are very common preprocessing 
steps to reduce skewness in t he data and help eliminate t he eff ects of systematic variatio ns between different 
microarrays (see for example lAmaratunga and Cabreral (|2004[ ) and lDudoit et ali (|2002i )). 
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Figure 4: Linear regression with ?i = 200, p = 1000. The black and grey bars correspond to the worst case and 
r-concave procedures respectively, with higher bars being preferred. The crosses correspond to a theoretical oracle- 
driven Lasso procedure (see the beginning of Section [4. II for further details). The y-axis label gives the error control 
level I. 
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Table 1: Improvement in classification error (%) over the naive classifier which always determines the data to be 
from a cancerous tissue. Thus the classification errors are 33|% minus these quantities. We also give the average 
number of variables selected in parentheses. 



cr 10 



12 



Worst case procedure 
q l^Q.l Z = 0.5 



r-concave procedure 
l = 0.\ 1^0.5 



8 4.9 (0.5) 11.6 (1.1) 16 (2.3) 17.5 (5.1) 
10 0.9 (0.1) 10.6 (0.9) 14.7 (1.6) 15.8 (4.4) 
12 0.0 (0.0) 9.4 (0.8) 12.8 (1.1) 15.8 (4.1) 
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Figure 8: For I = 0.1 (left) and I — 0.5 (right), we have plotted the proportion of times a gene was selected by our 
r-concave CPSS procedure for all genes which were selected at least 5% of the time among the 128 repetitions. Solid 
black means the gene was selected in every repetition, and white means it was never selected. Thus dark vertical 
lines indicate that the choice of q has little effect on the end result of CPSS. 



We applied CPSS with £i (Lasso) penalised logistic regression as the base pr ocedure, with i? = 50, and 



choosi ng r both using the r-concave bound of Section !?^ and the original bound of lMeinshausen and Biihlmann 



( 2010t ). We estimated the expected classification error in the two cases by averaging over 128 repetitions 



of stratified random subsampling validation, taking 8 cancerous and 4 normal observations in each test set. 
Thus when applying CPSS, we had n = 40 -I- 22 - 12 = 50. We looked at q = 8, 10 and 12, and set r to 
control E|5'^PSS n Lg/p\ < I with / = 0.1 and 0.5. 

Rather than subsampling completely at random when using CPSS, we also stratified these subsamples 
to include the same proportion of cancerous to normal samples as in the training data supplied to the 
procedure. Without this step, some of the subsamples may not include any samples from one of the classes, 
and applying >S'[„/2J to such a subsample would give misleading results. Using stratified random subsampling 
is still compatible with our theory, provided that E|5'^'^^^ H L^j is interpreted as an expectation over random 
data which contain the same class proportions as observed in the original data. In general, this approach of 
stratified random subsampling is useful when the response is categorical. 

The results in Table [T] show that, as expected, the new error bo unds allow one to select more variables 
than the conservative bounds of Meinshausen and Biihlmann ( 2010t ) for the same level of error control, and 



as a consequence, the expected prediction error is reduced. Figure [8] demonstrates the robustness of the 
selected set to the different values of q. Finally, we also applied CPSS on the entire dataset with q = 8 
and B = 50 and using the r-concave bound of Section [Ql to choose r to control E|5'^'^^^ n ig/p| < 0.5 (cf. 
Figure in . We see that with just 5 genes out of 1908, we manage to separate the two classes quite well. 

A Appendix 

A.l Proof of Theorem [1] 

The proof of Theorem [1] requires the following lemma. 
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Figure 9: A heatmap of the normalised, centered, log intensity values of the genes selected when we use the r-concave 
bound to choose r such that we control EIS'^,'^^^ n < 0.5. 



Lemma 5. (i) If t £ (5,1]; then 
(a) Ifre [0, i), then 



CPSS\ 



< 



2t- 1 



Pk,l7i/2\ ■ 



1 

1 ~2t' 

Proof, (i) Let A = {{A2j-i,A2j) : j = 1,...,B} be randomly chosen independent pairs of subsets of 
{1, . . . , n} of size [n/2j such that ^2^-1 n A2j = 0. Then 



1 



(9) 



Now E{ns(fc)} = E{E(nB(fc)|^)} = pf, |^^y2j because 5(742^-1) and S{A2j) are independent conditional on 
A. It follows using ^ that 

P(fc e ^^pss) ^ p{n5(fc) > ^} < p| 1 (1 + n5(fc)) >t}= p{nB(fc) > 2t - 1} 

2r"^^'.Ln/2j' (10) 



< 



where we have used Markov's inequality in the final step. 

(ii) Define tig" and Ilg" by replacing Sn with 7V„ := {1, . . . ,p} \ S'n in the definitions of IIb and 
respectively. Then, using the bound corresponding to ([9]) and Markov's inequality again. 



(fc ^ = n^Bik) <t}= P{n^"(fc) > 1 - r} < P{n^"(fc) > 1 - 2r} 



< 



1 - 2t 



(1 -Pfc,Ln/2j)^ 



□ 



Proof of Theorem \T\ 
(i) Note that 



]Sln/2\ n Lg\ ^EiJ^^ikeS^^^^iy'^ip^Ar^m- 



<»} - $I^^'=^L»/2Jl{p.,LV2J<n• 



By Lemma [5l it follows that 



fc=i 



k=l 



< 



2t 



1 

k=l 



/2jl{p.,L./2j<e} < ^ E|5'l„/2J r\Lg\ 



(ii) This proof is very similar to that of (i) and is omitted. 



□ 
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A.2 Proof of Theorem H 

The proof of Theorem [5] requires several prehminary results, and we use the following notation. Let G denote 
the finite lattice {0, -g, -|-, . . . , 1} — -gZ n [0, 1]. If / is a probability mass function on G, we write fi for 
f{i/B), thereby associating / with (/q, /i, . . . , fs) & 

For i e G, we denote the probability that a random variable distributed according to / takes values 
greater than or equal to t by 7t (/) ■— J2i>Bt fi- We also write f (/) :— J2f=i ifi ^^e expectation of this 
random variable and supp(/) :— {i/B G G : fi > 0} for the support of /. 

Let U be the set of all unimodal probability mass functions / on G, and let U,i ^ {f E U : £{f) < ?/}. 
We consider the problem of maximising 7t over f ElAr,. Since the cases 77 = and t <rj are trivial, there is 
no loss of generality in assuming throughout that < rj < t and t G G, so in particular t > 1/B. 

Lemma 6. There exists a maximiser of Tt inUr/. 

Proof. Since 7* : M^+^ ^ M is linear and therefore continuous, it suffices to show that C K-^+^ is 
closed and bounded. Now Ujj is bounded as Ujj C [0, 1]-^+^. Moreover, the hyperplane H = {{xq, . . . ,xb) ■ 
xo + xi + . . . + xb = 1} is closed. Also, £ is a continuous function on R'^^^, so £~^{[0,7]]) is closed. 
Now let O ~ {/ G M^+^ : / is not unimodal}. If / S O then there must exist ii < 12 < is such that 
/i2 < min{/ij, /i,}. Clearly this inequality must hold for all g in a sufficiently small open ball about /, so O 
is open. We see that 

^ Hn£-\[0,ri])nO''. 

Thus is an intersection of closed sets and hence is closed. □ 

We will make frequent use of the following simple proposition in subsequent proofs. 
Proposition 7. Suppose that {xi, . . . , a;„) e K" and {yi, . . . , yn) G R" satisfy 

n n 
i=l i=l 

and that there exists some i* G {1, . . . , n} with Xi > yi for all i < i* and Xi < yi for all i > i* . Then 

n n 

^ixi < ^iyi, 

i=l 1=1 

with equality if and only if Xi — yi for i = 1, . . . , n. 
Proof. We have 

^ i{xi - yi) < j* ^ {xi - yi) = ^ [Vi ~ Xi) <^ i{y^ - Xi). 

iKi* i'>i* 

□ 

The following result characterises the extremal elements of Ur) in the sense of maximising the tail proba- 
bility Tt. In particular, it shows that such extremal elements can take only one of two simple forms. 

Proposition 8. Any maximiser f* G Uri of Tt satisfies 

ft) nn = V, 

(a) writing iM /c" ^ iiiax(supp(/*)), we have either 

(a) /o* > fl = fi = ... = > or 

(b) iM = t and /o* =ft = ... = fl,_, < /*, . 
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min(supp(/*)). As 77 < r, we 



Proof, (i) Suppose /* £ U,^ maximises 7t, but that £{f*) < rj. Define i„ 
must have i,„ < Bt. Define g by 

{0 lii <ira 

fi - ei if i = «m 
/* +62 iii>im 

where ei, £2 > are chosen such that 'Yld=Q9i — but are small enough that E{g) < rj. Then g £lArj but 
Tt{g) > Tt{f*), a contradiction. 

(ii) Suppose first that there exists a mode of /* which is at least t. Let g £ Uri be such that gi — f* for 
i> Bt and gi — J2e=o^ fi * fa — fi — ■ ■ ■ — fst^ '^^ can apply Proposition [7] to see that 



£{g)<£{f*). 



(11) 



But Tt{g) = Tt{f*), so by optimality of /* we must have equality in (fTTj) . Thus Proposition [7] gives us that 

Next, define h £ hy ^ f* for i < Bt, hgt = Tt{f*), and /i, = for i > Bt. Then Ttih) = %{!*)■ 
Again Proposition [7] and the optimality of /* give that /* — h. Thus /* satisfies property (ii)(b) of the 
theorem. 

Now suppose that there is no mode of /* which is at least t, so fg^ > fst+i — ■ ■ ■ — fs- L'^t g £ Uri 
satisfy gi = f* for i > Bt and gi = . . . ^ gst- We must have go > gi, otherwise /* would have a mode at t. 
As Tt{g) = Tt{f*), optimality of /* and Proposition [7] imply /* = g. 

Finally, let h £lAri satisfy hi = f* for i < Bt and hst = hst+i = . ■ . = h^-i > hk, where k and hk are 
chosen such that X]^o ^i — I. As before. Proposition [7] allows us to deduce that /* — h. Thus /* satisfies 
property (ii)(a) of the theorem. □ 

We are now in a position to state Markov's inequality for random variables with unimodal distributions 
on G, which may be of some independent interest. 

Theorem 9 (Markov's inequality under unimodality) . Let X be a random variable with a unimodal distri- 
bution on G = {0, -i, -|, . . . , 1}, and let t £ G. If r] := E(X) < 1/3, then 



\X >t) < < 



2t- ^ 

277(1 -t 



1 + 5 



if t£ {rj, min (§77+ 2^,277)] 
if t£ (min (§77 +2^,277), \] 

(i, 1]. 



Let d be defined by 



d d{ij, B) ^-2{t]- i) (677 + 1) + 
If rj > 1/3 and d> 0, then 



2-477 (4?7-l)2 



B 



B' 



F(X >t) < 
Finally, if rj > 1/3 and d < 0, then 



t+i 
277(1 -t+- 

1 + i 



^^(77, i + ^(l + i-dl/2) 

^ea+i(i+i-rf^/').i 



"{x >t)< 



t + 
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Proof. Proposition [5] tells us that ]P{X > t) must be at most the maximum of the optimal solutions to the 
following two optimisation problems: 

(P): Maximise b{s — Bt) + c in a, b,c, s (Q); Maximise 6 in a, & 

subject to a + {s — 1)6 + c = 1 subject to Bta + b — 1 

f (s - 1)6 + sc = S?7 ^{Bt-l)a + Btb = Bf] 

a>b>c>0 b>a>0. 
se {Bt,Bt + l,...,B} 

Problem (P) corresponds to case (ii)(a) of Proposition [U and problem (Q) to case (ii)(b). 

The solution to (Q) is determined entirely by the constraints, and we see that the optimal value is 

To solve (P), we break it into 5(1 — t) + l subproblcms: for s E {Bt, Bt + 1, . . . , B}, we define subproblem 
(P(s)) as follows: 

(P(s)): Maximise b{s — Bt) + c in a, b, c 
subject to a + (s — 1)6 + c = 1 
f (s - 1)6 + sc = P77 
6 > c, 
a,b,c > 0. 

Notice that we have not included the a > b constraint. This is because Proposition [5] ensures that this 
constraint is always satisfied at an optimal solution of (P), so there exists s* such that every optimal 
solution of (P(s*)) corresponds to an optimal solution of (P). 

Now each subproblem is a standard linear programming problem, so we know that one of the basic 
feasible solutions must be optimal. Since a > 0, all basic feasible solutions must have either c — or 6 = c. 
Thus we may replace the subproblcms (P(s)) by 

(P'(s)): Maximise 6(s -Bt + 1) in a, 6 
subject to a + s6 = 1 

^{s + l)b = Br] 
a,b>0. 

The second constraint is enough to determine that the optimal value of P'(s) is 

2Bt]{s - Bt + l 



s{s + l) 



=:7(s). (13) 



Now we can proceed to find an s* which maximises 7 over {Bt, Bt + 1, ... , B}. The sign of 7'(s) is the sign 
of 

-s^ + 2{Bt- l)s + Bt- 1. 

This quadratic in s has roots 

Bt-1± y^iBt-l)^ + Bt -1. 

So 7(5) is increasing for all s £ {Bt, Bt + 1, ... , B} with 



s<Pt-l + V(P<-i)'-i=:5o. (14) 

When So < B, we must have s* S {2Bt — 2, 2Bt — 1}. In fact, by examining ([13]), we see that ^{2Bt — 2) = 
j{2Bt — 1). Also, from ([Tl)) . we see that when t > 1/2, we have that Sq > B, so s* = B. So far, we have 
shown that 

V{X >t)< max(6i,62,63). 
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where bounds bi , 62 and 63 are given by 



bl :— bi{t,ri,B) — , , \_ l{))<t<min(2r,,l)} 

bi := b2{t,T],B) ^ ^ l{,Kt<i/2} 

zt -g 

O3 ■— 03[t,ri,IJ) — j_ ■''■{max(»),l/2)<t<l}- 

AU that remains now is to determine which of bi , 62 and 63 have the largest value. We first consider the 
case when f] < ^. When t < min(l/2, 277), 

sgn(62 - 61) = sgn { (i - fry - (i - i) } . 

Now for 1/2 < i < 2?7, 

dbs _ 277 ^ (2?7+|) ^ 96i 

Furthermore, 

^3 (i + 2^,r;,S) = , > ^!^^i±M = 6, (i + . 

2 ^ 2S 

Putting this together gives the required bound for 77 < 1/3. 

When 77 > 1/3, we can ignore 62 as it is dominated by bi. Comparing bi and 63, we get the final cases of 
the bound. □ 

Proof of Theorem [2] 

Recalling that E{nB(fc)} — pj, y^/i] ' '^^ follow the proof of LemmajSJ but apply TheoremlH] at the last step 
of with t = 2t — 1 to deduce that if the distribution of lis (A:) is unimodal, then 



e ^^PSS) < p{n^(fc) > 2t - 1} < C(r, B)p' 



k,[rL/2\ ' 



where C{t,B) is given in the statement of Theorem [5] The bound for E|S'^^^^^ n Lgj then follows in the 
same way that Theorem [T] follows from Lemma [5] □ 



A. 3 Proofs of results on r-concavity 

Proof of Proposition [3] 

Suppose that / is log-concave, so we may write / — e^"^ where (/) is a convex function. If r < 0, then —r<j> is 
convex, and as the exponential function is increasing and convex, /'' — e"'"'^ is convex. 

Conversely, suppose that / is not log-concave, so there exist x,y and A G (0, 1) with f{\x + (1 — X)y) < 
fixYf{y?~^- Then as Mr{f {x) J {y); X) f{x)^f{yf~^ as r ^ 0, we must have f{\x + (1 - \)y) < 
Mr{f{x), f{y); A) for some r < 0, and so / cannot be r-concave. □ 
Proof of Proposition H] 

Let / = {1, . . . ,1}\J {u, . . . , B — 1}. The conditions on / imply that 

/j > min{/i_i, /i+i}, i e I. 

Then as Mr{fi-i, /j+i, ^) — > min{/i_i, fi+i} as r — >■ —00, for each i £ I, may choose an < with 

/, >M,,(/,_i,/,+i;i). (15) 

Set r = minig/r^. Observe that as Mj.{a,b; ^) is increasing in r for all fixed a and b, the inequalities p5p 
are all satisfied when — r. Thus f[ < ^{fl^i + fi+i) for all i G {1, . . . , i? — 1}, so / is r-concave. □ 
By analogy with the unimodal case, let Jv.,, = {/ G Jv : £{f) < rj}. In maximising 7t over J>,,,, there is 
again no loss of generality in assuming < rj < t. 
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Lemma 10. For each r < 0, there exists a maximiser ofTt in J>,,, . 

Proof. This proof is almost identical to that of LemmalHl except here we let O = {/ G : is not convex}. 

If / G O, then there must exist «i<i2<*3 such that 

{^:^-^2)fl+{^2-^l)fl<{i3-il)fl 

and it is clear that the above inequality must hold for all g in a sufficiently small open ball about /. Thus 
O is open, and the rest of the proof is clear. □ 

Proposition 11. Any maximiser f* £ Jv ,j o/7t satisfies 

(^) nn = V 

(a) /*'' is linear between /q'" and f*[j_-^, where im = i?max(supp(/*)). 

Proof, (i) Suppose that S{f*) < rj. Define i,n ■= -B min(supp(/*)). Let — /*'" and define a new sequence 
^■.= {^,:i = 0,...,B)hy 

{oo if Z < im 

0i + ei if z = 
0i - 62 if Z > ?m 

where ei, £2 > are chosen such that J2f=o''py^ — 1' but are small enough that < rj. Then ip is 

convex, so V^^' G J'r,'n- Since t] > 0, we must have Tt{f*) > so max(supp(/*)) > t. Also, as we are 
assuming 77 < t, we must have i„i < t. Therefore Itiij^^^) > 7t(/*), which is a contradiction. 

(n) Set = /*^ so (j> is convex and ^^Z'' = /*. Define V' = (i/'o,--->b) G as follows. Take 

Tpl = (pi for i > Bt, but make ijj' linear between ipQ and ^'g^ such that g :— tp''^^^ has X]^o5* ~ ^ ''^^'^ 
go > 0. This is possible since £{f*) < rj < t, so min(supp(/*)) < t. Note that tp' is still convex since we must 
have Vst - fp'st-^i < 0Bt - (f'Bt-i- Also Tt{g) = Tt{f*). Applying Proposition [3 we see that £{g) < £{f*). 
Optimality of /* means that equality must hold, so /* — g and also (p — ip'. 

Now if (p is in fact linear between 0o and (pB, condition (ii) of the theorem is satisfied and we are done. 
Otherwise we may assume p is not a linear function between pBt-i and pB and we can define ip such that 
ipi = pi for i < Bt, that -0 is linear between ipBt-i and ipk-i and ipi — 00 for i > k. Here, k is chosen 
such that g :— ip^/^ has X]^o5« ~ 1' ^^'^ convexity of </> ensures that such a k < B exists. Applying 
Proposition [3 we see that £{g) < £{f*). Since Tt{g) — Tt{f*), as before, optimality of /* allows us to 
conclude that f* — g. □ 

A. 4 Computing the r-concave tail probability bound 

Here we describe a numerical algorithm that computes the function D defined in Section 13.31 Note that 
this is the maximum of 7t{f) over / e Jv,,,. We shall only discuss the case where /* is decreasing, as is 
always the case when t > 2rj. The increasing case is very similar and less important for our application. We 
first note that we may parametrise the r-concave probability mass functions whose r**^ powers are linear as 
follows: 

/a... = -f-^-— , Z = 0,l,...,fc (16) 

where k < B. As £{fa,k) is strictly increasing in a, for each k, there is a unique au for which £{fak,k) = V- 
We also note here that decreases with k. This is easily seen by observing that, regardless of the value of 
fc, the parameter a in (jl6p determines the ratio of fa,k-i to fa,k;j, each z, j. 

According to Proposition 111! if /* G J>^^ maximises 7t, then f*"^ is linear up to its penultimate support 
point. We can parametrise these in the following way. Write 

Eti»(a + »)^/'- + (fc + l)c _ 
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and then solve for c: 

sr/E;=o(«+jr/'--Eti*(«+* 



c = c(a, k) — — ' 



k + l-B-q 

We see that as a ranges through [afe+i,afc], we obtam all the relevant probability mass functions supported 
on 0, 1, . . . , fc + 1 via 



{a + i) 



l/r 



ga,k-k+l 



E-=o(« + jT/'- + c(a,fc) 
c(a, fc) 



z = 0,l,.. 



E-=o(« + jT/'- + c(a,fc)' 
The tail probability of ga,k, when the threshold is t, is 



and we may maximise this over a e [ak+i, Ofe] to obtain an optimal for each fc. This is easily accomplished 
using a general purpose optimiser such as optimize in R. To summarise, we have the following simple 
procedure for computing 7t(/*). 

1. For each k G {t, . . . , B}, determine (numerically), the solution in to £{fa,k) = V- 

2. Find a* := argmax„g[„^^^^„^] Tt{ga,k), for each fc. 

3. Let fc*(t) := argmaxj, 7t(ga*,fc)- 

Then Tt{f*) = Ttiga', ,fe'(t))- When we wish to evaluate Tt{f*) for a range of values of t, the process is 

k (t) ' 

simplified by the observation that k*{t) is increasing in i, and thus in Step 2 we need only consider those fc 
which are at least k*{t — 1/B). 

Using the algorithm described above, we have computed 

iam{D{e^, 2r - 1, 50, -1/2), D{e, r, 100, -1/4)} 

over a grid of 9 and t values (cf. Tables [5] and [3]) . An R implementation of the algorithm is available from 
both authors' websites. 
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Table 2: Table of values of min{D(6l^, 2t - 1, 50, -1/2), D{e, r, 100, -1/4)} for 9 € {0.01, 0.02, 0.03, 0.04, 0.05}. 
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Table 3: Table of values of min{£)(6»^ 2t - 1, 50, -1/2), D{e,T, 100, -1/4)} for 6 € {0.06, 0.07, 0.08, 0.09, 0.1}. 
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